Provide an overview of the project goals and motivation
The motivation for a project exploring the correlation between low birth weight, premature birth rate, cancer mortality rate, cardiovascular disease hospitalization rate, asthma hospitalization rate and PM2.5 concentration, as well as socioeconomic, racial, education, and gender factors across the 62 counties of New York State, stems from a compelling public health imperative. All the outcomes of interest are critical indicators of population health. By examining the impact of fine particulate pollution, this project aims to uncover environmental health disparities that disproportionately affect under-served communities. It seeks to inform public health policy by highlighting the need for targeted interventions that could mitigate the adverse effects of air pollution on vulnerable populations, including pregnant women and newborns. The analysis of socioeconomic and racial factors will offer insights into the complex interplay between environment and social determinants of health, potentially guiding urban planning and healthcare resource allocation. Furthermore, by considering gender-specific vulnerabilities, the project aims to provide a comprehensive overview of the risk landscape, fostering community awareness and prompting action to improve air quality and health equity in one of the most densely populated urban areas in the world.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Source, scraping method, cleaning, etc.
pm2_5 <- pm2_5 %>%
janitor::clean_names()%>%
select (county,value) %>%
rename (annual_pm2.5 = "value")
The dataset is obtained from EPH Tracking Website from CDC (https://ephtracking.cdc.gov/DataExplorer/). This has 62
rows and 2 columns of data. In which, 2 variables are:
- county: NY county name
- annual_pm2.5: annual estimated PM2.5 concentration at
each county (ug/m^3)
The following set of demographic datasets is obtained from Census Reporter webpage (https://censusreporter.org/).
edu <- edu_NY %>%
janitor::clean_names()%>%
mutate (percentage_high_education = (bach_male+master_male+prof_male+doct_male+bach_female+master_female+prof_female+doct_female)/total) %>%
filter(str_detect (name, " County, NY")) %>%
mutate(county = str_replace (name, " County, NY", "")) %>%
select(county,percentage_high_education) %>%
mutate(county = str_replace (county, "St.", "St")) %>%
mutate(county = str_replace (county, "Stuben", "Steuben"))%>%
select(county,percentage_high_education)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percentage_high_education: percentage of population who
finished a higher education level (higher than a bachelor degree)
race_non_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_white") %>%
select (county,percent_non_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_non_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Non-Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_non_hisp_black") %>%
select (county,percent_non_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_non_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_white <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "White Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_white") %>%
select (county,percent_hisp_white) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_white ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_hisp_black <- race_NY %>%
janitor::clean_names()%>%
filter( x1 == "Black Hispanic") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_hisp_black") %>%
select (county,percent_hisp_black) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_hisp_black ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
race_merge <- race_non_hisp_white %>%
inner_join(race_non_hisp_black, by = "county") %>%
inner_join(race_hisp_white, by = "county") %>%
inner_join(race_hisp_black, by = "county")
race_merge <- race_merge %>%
mutate (percent_other = 1 - percent_non_hisp_white - percent_non_hisp_black - percent_hisp_white - percent_hisp_black)
This has 62 rows and 6 columns of data. In which, 6 variables
are:
- county: NY county name
- percent_non_hisp_white: percentage of population who are
identified as Non-Hispanic White
- percent_non_hisp_black: percentage of population who are
identified as Non-Hispanic Black
- percent_hisp_white: percentage of population who are
identified as Hispanic White
- percent_hisp_black: percentage of population who are
identified as Hispanic Black
- percent_other: percentage of population who are
identified as any other ethinic groups
income <- HHincome_NY %>%
janitor::clean_names() %>%
filter (x1 == "percent_high_income") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_high_income") %>%
select (county,percent_high_income) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_high_income ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_high_income: percentage of population who are at
higher income households (>$75,000 annually)
age <- Age_NY %>%
janitor::clean_names() %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "median_age") %>%
select (county,median_age) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,median_age ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- median_age: median age of the population at each
county
sex <- Sex_NY %>%
janitor::clean_names() %>%
filter (x1 == "Male:") %>%
pivot_longer(
albany_county_ny : yates_county_ny,
names_to = "county",
values_to = "percent_male") %>%
select (county,percent_male) %>%
separate(county, into = c("county", "x"), sep = "_county_") %>%
select (county,percent_male ) %>%
mutate(county = str_replace (county, "_", " ")) %>%
mutate_at(vars(county), str_to_title)
This has 62 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_male: percentage of population who are identified
as male
lowbirthweight <- lowbirthweight %>%
janitor::clean_names()%>%
select (region_county, percentage)%>%
rename (county = "region_county", percent_lowbirthweight = "percentage")
This has 87 rows and 2 columns of data. In which, 2 variables
are:
- county: NY county name
- percent_lowbirthweight: percentage of children being born
being identified as low birth weight (<2,500g) (https://www.health.ny.gov/)
The following dataset is obtained from New York State Department of Health (https://www.health.ny.gov/) that contains 4 different health indicators that will complement with the low birthweight. These 5 will act as our outcomes of interest for our regression models.
health <- health %>%
janitor::clean_names()%>%
select (geographic_area,indicator_title,topic_area,rate_percent,measurement) %>%
filter( str_detect (geographic_area, " County")) %>%
mutate (county = str_replace (geographic_area, " County", "")) %>%
select (county, everything()) %>%
select (-geographic_area) %>%
filter(topic_area == "Cancer Indicators" | topic_area == "Respiratory Disease Indicators" | topic_area == "Cardiovascular Disease Indicators" | topic_area == "Maternal and Infant Health Indicators")
cancer <- health %>%
filter (topic_area == "Cancer Indicators") %>%
filter (indicator_title == "All cancer incidence rate per 100,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (cancer_mortality_per_100k = "rate_percent")
resp <- health %>%
filter (topic_area == "Respiratory Disease Indicators") %>%
filter (indicator_title == "Asthma hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (asthma_hosp_rate_per_10k = "rate_percent")
cardio <- health %>%
filter (topic_area == "Cardiovascular Disease Indicators") %>%
filter (indicator_title == "Cardiovascular disease hospitalization rate per 10,000") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (cardio_hosp_rate_per_10k = "rate_percent")
maternal <- health %>%
filter (topic_area == "Maternal and Infant Health Indicators") %>%
filter (indicator_title == "Percentage of premature births with <37 weeks gestation") %>%
select (-c(indicator_title, topic_area, measurement)) %>%
rename (premature_percentage = "rate_percent")
They are:
- cancer_mortality_per_100k: percentage of cancer mortality
per 100 thousands people in each NY county (Cancer Indicator)
- asthma_hosp_rate_per_10k: percentage of asthma
hospitalization per 10 thousands people in each NY county (Respiratory
Disease Indicator)
- cardio_hosp_rate_per_10k“: percentage of
cardiovascular-disease-related hospitalization per 10 thousands people
in each NY county (Cardiovascular Disease Indicator)
- premature_percentage: percentage of children being born
prematurely (<37 gestational weeks) in each NY county
Here we perform inner_join() to create 2 bigger datasets
health_merge and demographic_merge. Then, we
join them with our lowbirthweight & pm2_5
tp make a finalized data frame called merge. And, we will
use this for regression model.
health_merge <- maternal %>%
inner_join(cancer, by = "county") %>%
inner_join(resp, by = "county") %>%
inner_join(cardio, by = "county")
demographic_merge <- age %>%
inner_join(sex, by = "county") %>%
inner_join(income, by = "county") %>%
inner_join(race_merge, by = "county") %>%
inner_join(edu, by = "county") %>%
mutate(county = str_replace (county, "St ", "St. "))
merge <- lowbirthweight %>%
inner_join(pm2_5, by = "county") %>%
inner_join(health_merge, by = "county") %>%
inner_join(demographic_merge, by = "county")
merge <- merge %>%
select (county, annual_pm2.5,everything())%>%
mutate(asthma_hosp_rate_per_10k = as.numeric(asthma_hosp_rate_per_10k))%>%
mutate(cardio_hosp_rate_per_10k = as.numeric(cardio_hosp_rate_per_10k)) %>%
mutate(premature_percentage = as.numeric(premature_percentage))%>%
mutate(cancer_mortality_per_100k = as.numeric(cancer_mortality_per_100k))
uscounties <- uscounties %>%
filter (state_id == "NY") %>%
select (county, lat, lng)
map <- merge %>%
inner_join(uscounties, by ="county")
write.csv(map, "NY_map.csv", row.names = FALSE)
This file is specifically made for the purpose of making map (in Shiny App).
Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
merge %>%
plot_ly(
x = ~reorder(county, percent_high_income),
y = ~percent_high_income,
type = "bar",
marker = list(color = "red1")
) %>%
layout(
title = "Percentage of High Income",
xaxis = list(title = "County Name", categoryorder = "total descending"),
yaxis = list(title = "Percentage"),
barmode = "stack"
)
race_plot <- merge %>%
select (county, percent_non_hisp_white, percent_non_hisp_black, percent_hisp_white, percent_hisp_black, percent_other) %>%
pivot_longer(
cols = starts_with("percent_"), names_to = "race", values_to = "percentage")
race_plot%>%
plot_ly(x = ~county, y = ~percentage, type = "bar",color = ~race,colors = "RdYlGn", hoverinfo = "y+name") %>%
layout(barmode = "stack",
title = "Racial Composition in NY county",
xaxis = list(title = "County"),
yaxis = list(title = "Percentage (%)"))
LBR_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~percent_lowbirthweight,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Low Birthweight",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Percentage of Low Birthweight"),
showlegend = FALSE
)
premature_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~premature_percentage,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Premature Birth",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate of Premature Birth"),
showlegend = FALSE
)
asthma_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~asthma_hosp_rate_per_10k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Asthma Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
cancer_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cancer_mortality_per_100k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cancer Mortality",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 100k"),
showlegend = FALSE
)
cardio_graph <- merge %>%
drop_na() %>%
plot_ly(
x = ~county,
y = ~cardio_hosp_rate_per_10k,
color = ~county,
type = "scatter",
mode = "markers"
) %>%
layout(
title = "Rate of Cardiovascular Disease Hospitalization",
xaxis = list(title = "County", tickangle = 90),
yaxis = list(title = "Rate per 10k"),
showlegend = FALSE
)
If you undertake formal statistical analyses, describe these in detail
After cleaning the data, we then look at the effect that the annual concentration of pm 2.5 separately has on multiple outcomes among 62 counties in New York, including low birth weight rate, premature birth rate, cancer mortality rate, asthma-related hospitalization rate, and cardiovascular-disease-related hospitalization rate.
Since all of the outcomes are measured as rates (i.e., continuous), we use linear regression models to assess the effect of annual concentration of pm 2.5 on the outcomes of interests, adjusting for age, sex, ethnicity, education, income. In our models:
Age is reported as a continuous variable that reflects the median age of each county.
Sex is reported as a continuous variable that reflects the percentage of male in each county. The variable that reflects the percentage of female in each county is not included in our model to avoid perfect multicollinearity.
Ethnicity is reported as a continuous variable that reflects the percentage of different ethnicity groups in each county, including Hispanic-Black, Hispanic-White, non-Hispanic-Black, non-Hispanic-White, and Others (if Asian, Native American, or belong to two or more race). The variable that reflects the percentage of people who belong to the category Others is not included in our model to avoid perfect multicollinearity.
Education is reported as a continuous variable that reflects the percentage of people who have obtained one or more higher education degrees in each county.
Income is reported as a continuous variable that reflects the percentage of household of each county that have annual income exceeding $75,000.
For each of the five outcomes of interest, we first start with the full model:
Outcome = Annual pm 2.5 concentration + median age + percentage of male + percentage of Hispanic-Black + percentage of Hispanic-White + percentage of non-Hispanic-Black + percentage of non-Hispanic-White + percentage of household income exceeding 75,000 USD + percentage of people obtained higher education
We then perform bidirectional stepwise selection on all five models and output the best models. The full model and the resulting best models after stepwise selection for each outcome are detailed below:
lm_pm2.5_birthweight_adjusted <-lm(percent_lowbirthweight ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_birthweight_adjusted_best <- step(lm_pm2.5_birthweight_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_birthweight_adjusted_best)%>%
tab_model()
| percent lowbirthweight | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 11.53 | 7.18 – 15.88 | <0.001 |
| median age | -0.11 | -0.21 – -0.02 | 0.024 |
| percent non hisp black | 8.05 | 1.34 – 14.76 | 0.019 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.274 / 0.249 | ||
Interpretation
The best model that was outputted by the stepwise regression is
$ = 11.53 - 0.11() + 8.05 ()$
On average, for counties whose median age is 0 and has zero percent of non-Hispanic-Black, the expected low birth weight rate is 11.5% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.05%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
The adjusted R-squared of 0.249 implies that 24.9% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
lm_pm2.5_premature_adjusted <-lm(premature_percentage ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_premature_adjusted_best <- step(lm_pm2.5_premature_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_premature_adjusted_best)%>%
tab_model()
| premature percentage | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 4.88 | 0.96 – 8.80 | 0.016 |
| median age | 0.08 | -0.01 – 0.17 | 0.067 |
| percent non hisp black | 8.03 | 2.63 – 13.42 | 0.004 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.135 / 0.105 | ||
Interpretation
The best model that was outputted by the stepwise regression is
$ = 4.88 + 0.08() + 8.03 ()$
On average, for counties whose median age is 0 and has zero percent of non-Hispanic-Black, the expected low birth weight rate is 4.88% (no practical meaning). Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
lm_pm2.5_cancer_adjusted <-lm(cancer_mortality_per_100k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_cancer_adjusted_best<-step(lm_pm2.5_cancer_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cancer_adjusted_best)%>%
tab_model()
|
cancer mortality per 100 k |
|||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -1.85 | -517.25 – 513.56 | 0.994 |
| annual pm2 5 | 13.27 | 0.99 – 25.54 | 0.035 |
| median age | 13.14 | 9.82 – 16.46 | <0.001 |
| percent high income | -213.49 | -373.22 – -53.75 | 0.010 |
| percent non hisp white | 464.03 | 200.92 – 727.14 | 0.001 |
| percent non hisp black | 536.28 | 22.68 – 1049.88 | 0.041 |
| percent hisp white | 1023.82 | 103.51 – 1944.13 | 0.030 |
| percent hisp black | -1787.43 | -4172.76 – 597.90 | 0.139 |
| percent male | -574.15 | -1409.17 – 260.87 | 0.174 |
| Observations | 61 | ||
| R2 / R2 adjusted | 0.809 / 0.780 | ||
Interpretation
The best model that was outputted by the stepwise regression is
$ = -1.85 + 13.27() + 13.14() - 213.49() + 464.03() + 536.28() + 1023.82() - 1787.43() - 574.15()$
On average, for counties whose annual pm 2.5 concentration is 0, median age is 0, has zero percent of non-Hispanic-Black, has zero percent of non-Hispanic-White, has zero percent of Hispanic-Black, has zero percent of Hispanic-White, has zero percent of male, the expected low birth weight rate is -1.85% (no practical meaning). However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the intercept is significantly different than 0.
On average, for every unit increase of median age (1 year), the expected low birth weight rate of the county decreases by 0.11%. However, since the p-value is larger than 0.05, we do not have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
On average, for every unit increase in the percentage of non-Hispanic-Black, the expected low birth weight rate of the county increases by 8.03%. Since the p-value is smaller than 0.05, we have sufficient evidence to claim that the beta coefficient for median age is significantly different than 0.
The adjusted R-squared of 0.105 implies that 10.5% of the variation in the response variable can be explained by its linear relationship with the set of the 2 predictors (median age and percentage of non-Hispanic-Black).
lm_pm2.5_asthma_adjusted <-lm(asthma_hosp_rate_per_10k ~ annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_asthma_adjusted_best <- step(lm_pm2.5_asthma_adjusted, direction = 'both', trace=FALSE)
summary(lm_pm2.5_asthma_adjusted_best)%>%
tab_model()
| asthma hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.36 | 0.06 – 2.67 | 0.041 |
| percent high income | 4.65 | 0.32 – 8.99 | 0.036 |
| percent non hisp black | 13.52 | 5.77 – 21.26 | 0.001 |
| percent hisp black | 366.67 | 314.84 – 418.50 | <0.001 |
| percentage high education | -5.31 | -9.55 – -1.06 | 0.015 |
| Observations | 59 | ||
| R2 / R2 adjusted | 0.935 / 0.930 | ||
lm_pm2.5_cardio_adjusted <-lm(cardio_hosp_rate_per_10k ~annual_pm2.5 + median_age + percent_high_income + percent_non_hisp_white + percent_non_hisp_black + percent_hisp_white + percent_hisp_black + percentage_high_education + percent_male, data=merge)
lm_pm2.5_cardio_adjusted_best<-step(lm_pm2.5_cardio_adjusted, direction = 'both', trace=FALSE)
summary (lm_pm2.5_cardio_adjusted_best)%>%
tab_model()
| cardio hosp rate per 10 k | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 157.52 | -155.87 – 470.91 | 0.318 |
| annual pm2 5 | 6.97 | -0.14 – 14.08 | 0.055 |
| median age | 1.73 | 0.01 – 3.44 | 0.049 |
| percent non hisp white | 117.26 | -30.46 – 264.97 | 0.117 |
| percent non hisp black | 222.99 | -83.73 – 529.70 | 0.151 |
| percent hisp white | 439.56 | -49.52 – 928.63 | 0.077 |
| percentage high education | -148.61 | -224.74 – -72.49 | <0.001 |
| percent male | -405.53 | -920.48 – 109.43 | 0.120 |
| Observations | 62 | ||
| R2 / R2 adjusted | 0.329 / 0.242 | ||
What were your findings? Are they what you expect? What insights into the data can you make?